NSF PAR Search | NSF Public Access Repository

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing

https://doi.org/10.14778/3746405.3746426

Shankar, Shreya; Chambers, Tristan; Shah, Tarak; Parameswaran, Aditya G; Wu, Eugene (May 2025, Proceedings of the VLDB Endowment)

Analyzing unstructured data has been a persistent challenge in data processing. Recent proposals offer declarative frameworks for LLM-powered processing of unstructured data, but they typically execute user-specified operations as-is in a single LLM call—focusing on cost rather than accuracy. This is problematic for complex tasks, where even well-prompted LLMs can miss relevant information. For instance, reliably extractingallinstances of a specific clause from legal documents often requires decomposing the task, the data, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to deine such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we callrewrite directives), as well as an optimization and evaluation framework. We introduce(i)logical rewriting of pipelines, tailored for LLM-based tasks,(ii)an agent-guided plan evaluation mechanism, and(iii)an optimization algorithm that efficiently finds promising plans, considering the latencies of LLM execution. Across four real-world document processing tasks, DocETL improves accuracy by 21–80% over strong baselines. DocETL is open-source at docetl.org and, as of March 2025, has over 1.7k GitHub stars across diverse domains.

Full Text Available

Search for: All records